Artem/fix gpu count#379
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughContext is threaded into inventory objects; resource quantities are sanitized and clamped to zero when negative or underflowing; pod tracking keys become namespace-aware; watchers restart cleanly and node events filter by name; tests verify clamping, underflow behavior, and allocatable change detection across both inventory metrics and node discovery. ChangesResource Quantity Sanitization and Context Threading
Sequence DiagramsequenceDiagram
participant NodeWatcher
participant updateNodeInfo
participant sanitizeResourceQuantity
participant Metrics
participant clampUint64
NodeWatcher->>updateNodeInfo: ctx, node data
updateNodeInfo->>updateNodeInfo: zero-reset fields
updateNodeInfo->>sanitizeResourceQuantity: node allocatable
alt negative value
sanitizeResourceQuantity->>sanitizeResourceQuantity: log error
sanitizeResourceQuantity->>updateNodeInfo: return 0
else valid
sanitizeResourceQuantity->>updateNodeInfo: return clamped
end
updateNodeInfo->>updateNodeInfo: store sanitized values
NodeWatcher->>Metrics: compute inventory metrics
Metrics->>clampUint64: allocatable total
clampUint64->>Metrics: return clamped value
Metrics->>Metrics: return metric set
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (1)
operator/inventory/node-discovery.go (1)
541-546:⚠️ Potential issue | 🟠 MajorRefresh node quantities on every
Added/Modifiedevent.This guard skips
updateNodeInfo()for capacity-only updates, and it also misses sub-core CPU changes becausenodeAllocatableChanged()relies onresource.Quantity.Value(). That leaves stale allocatable/capacity data innode, so the new sanitization/clamping path is bypassed on a normal node update.restartPodsWatcher()can stay behind change detection if you want, butupdateNodeInfo()itself should run for everywatch.Added/watch.Modified.🛠️ One safe way to split the refresh from the watcher restart
switch evt.Type { case watch.Added: fallthrough case watch.Modified: - if evt.Type == watch.Added || (knode != nil && nodeAllocatableChanged(knode, obj)) { - updateNodeInfo(ctx, obj, &node) + updateNodeInfo(ctx, obj, &node) + if evt.Type == watch.Added || (knode != nil && nodeAllocatableChanged(knode, obj)) { if err = restartPodsWatcher(); err != nil { return err } } signalLabels()In `k8s.io/apimachinery/pkg/api/resource`, does `Quantity.Value()` preserve milli-CPU precision, or does a quantity like `500m` become `0`? What comparison should be used to detect CPU allocatable changes without truncation?🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@operator/inventory/node-discovery.go` around lines 541 - 546, Always call updateNodeInfo(...) for every watch.Added or watch.Modified event instead of skipping it for capacity-only updates; keep nodeAllocatableChanged(...) solely to decide whether to call restartPodsWatcher(), not to gate the update. Modify the event handling in the Added/Modified case so updateNodeInfo(ctx, obj, &node) is invoked unconditionally when evt.Type is Added or Modified, and only if nodeAllocatableChanged(knode, obj) returns true then call restartPodsWatcher(). Update nodeAllocatableChanged(...) to compare CPU using Quantity.MilliValue() (or Quantity.AsDec/Compare if using decimals) to avoid truncation of sub-core (milli) CPU values and use proper Quantity.Cmp/Equal semantics for other resources.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cluster/kube/operators/clients/inventory/client_test.go`:
- Around line 406-408: The current test only compares
m.TotalAllocatable.Storage["default"] == 0 which false-passes if the "default"
key is missing; update the check in the anonymous check func to first assert the
map contains the "default" key (e.g. require.Contains/ok :=
m.TotalAllocatable.Storage["default"]) and then assert the value equals
uint64(0), referencing the existing m.TotalAllocatable.Storage map and the
"default" key.
In `@operator/inventory/node-discovery.go`:
- Around line 718-724: The current guard skips clamping when sanitized
gpuCapacity is 0, so change the logic in the block using gpuAllocatable and
gpuCapacity (variables gpuAllocatable, gpuCapacity and the
node.Resources.GPU.Quantity.Allocatable/Capacity fields) to clamp whenever
gpuAllocatable > gpuCapacity (remove the gpuCapacity > 0 check), call
node.Resources.GPU.Quantity.Allocatable.Set(gpuCapacity) and emit the same
log.Error with errGPUExceedsCapacity and context ("node", knode.Name,
"allocatable", gpuAllocatable, "capacity", gpuCapacity) so nodes with a
normalized capacity of 0 are correctly clamped.
- Around line 685-717: updateNodeInfo currently only updates fields present in
the incoming Allocatable/Capacity maps, leaving stale values for resources that
disappeared; before the two for-loops, explicitly reset the fields this function
owns: set node.Resources.CPU.Quantity.Allocatable.SetMilli(0) and
.Capacity.SetMilli(0), node.Resources.Memory.Quantity.Allocatable.Set(0) and
.Capacity.Set(0), node.Resources.EphemeralStorage.Quantity.Allocatable.Set(0)
and .Capacity.Set(0) (or .EphemeralStorage if that field name differs), and
node.Resources.GPU.Quantity.Allocatable.Set(0) and .Capacity.Set(0); then run
the existing loops to repopulate only current values so removed extended
resources won’t leave stale non-zero entries.
---
Duplicate comments:
In `@operator/inventory/node-discovery.go`:
- Around line 541-546: Always call updateNodeInfo(...) for every watch.Added or
watch.Modified event instead of skipping it for capacity-only updates; keep
nodeAllocatableChanged(...) solely to decide whether to call
restartPodsWatcher(), not to gate the update. Modify the event handling in the
Added/Modified case so updateNodeInfo(ctx, obj, &node) is invoked
unconditionally when evt.Type is Added or Modified, and only if
nodeAllocatableChanged(knode, obj) returns true then call restartPodsWatcher().
Update nodeAllocatableChanged(...) to compare CPU using Quantity.MilliValue()
(or Quantity.AsDec/Compare if using decimals) to avoid truncation of sub-core
(milli) CPU values and use proper Quantity.Cmp/Equal semantics for other
resources.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 0116d940-8636-4609-8a76-8b97ce343688
📒 Files selected for processing (5)
cluster/kube/operators/clients/inventory/client.gocluster/kube/operators/clients/inventory/client_test.gocluster/kube/operators/clients/inventory/inventory.gooperator/inventory/node-discovery.gooperator/inventory/node-discovery_test.go
45b5af0 to
7edb230
Compare
Node modified events must refresh inventory state even when pod watcher restart is not required. The restart predicate now uses quantity comparison so milli-CPU allocatable changes are not missed. Signed-off-by: Joseph Chalabi <chalabi.joseph@gmail.com>
Use Kubernetes namespaced names for pod accounting keys and simplify node watch event cases. Clarify why node status refresh resets only the fields it owns. Signed-off-by: Joseph Chalabi <chalabi.joseph@gmail.com>
Resolves:
akash-network/support#429
Details
I haven't found a bug in how do we handle inventory events.
Also, I've found recent report about GPU driver reporting wrong values.
That is why I decided to add a guardrails that prevents uint underflow, also I added logging, so we can track this in the future.
Changes
Testing
Tested locally at non-gpu machine and checked if CPU is consistently changed compared to the main branch.
Checked CPU and memory changes at each stages:
Create a provider -> Create a deployment -> Create a lease -> Send a manifest -> Delete a deployment.